AD-A227  105 


OHS  FILE  COPY 


d> 


NASA  Contractor  Report  182081 
ICASE  Report  No.  90-53 


ICASE 


PARALLELIZATION  OF  IMPLICIT  FINITE  DIFFERENCE 
SCHEMES  IN  COMPUTATIONAL  FLUID  DYNAMICS 


Naomi  H.  Decker 
Vijay  K.  Naik 
Michel  Nicoules 


Contract  No.  NAS  1-18605 
August  1990 

Institute  for  Computer  Applications  in  Science  and  Engineering 
NASA  Langley  Research  Center 
Hampton,  Virginia  23665-5225 

Operated  by  the  Universities  Space  Research  Association 


DT1C 

SELECTE  •% 
OCT  03 19901  1 


NASA 

National  Aeronautics  and 
Spam  Administration 

Langley  Research  Center 

Hampton,  Virginia  ?3665-5??5 


*0  16 


4B 


q$? 


DISTRIBUTION  STATEMENTA 

Approved  for  public  release; 
Distribution  Unlimited 


Parallelization  of  Implicit  Finite  Difference  Schemes  in  Computational  Fluid  Dynamics 


Naomi  H.  Decker  *  Vijay  K.  Naik 

ICASE,  NASA  Langley  Research  Center  T.  J.  Watson  Research  Center 
Hampton,  VA  23665.  Yorktown  Heights,  NY  10598. 

Michel  Nicoules  * 

T.  J.  Watson  Research  Center 
Yorktown  Heights,  NY  10598. 


Abstract 

Implicit  finite  difference  schemes  are  often  the  preferred  numerical  schemes  in  com¬ 
putational  fluid  dynamics,  requiring  less  stringent  stability  bounds  than  the  explicit 
schemes.  Each  iteration  in  an  implicit  scheme,  however,  involves  global  data  depen¬ 
dencies  in  the  form  of  second  and  higher  order  recurrences.  Efficient  parallel  imple¬ 
mentations  of  such  iterative  methods,  therefore,  are  considerably  more  difficult  and 
non-intuitive.  In  this  paper,  we  consider  the  parallelization  of  the  implicit  schemes  that 
are  used  for  solving  the  Euler  and  the  thin  layer  Navier-Stokes  equations  and  that  re¬ 
quire  inversions  of  large  linear  systems  in  the  form  of  block  tri-diagonal  and/or  block 
penta-diagonal  matrices.  We  focus  our  attention  on  three-dimensional  cases  and  present 
schemes  that  minimize  the  total  execution  time.  We  describe  partitioning  and  schedul¬ 
ing  schemes  for  alleviating  the  effects  of  the  global  data  dependencies.  An  analysis  of 
the  communication  and  the  computation  aspects  of  these  methods  is  presented.  The 
effect  of  the  boundary  conditions  on  the  parallel  schemes  is  also  discussed.  The  ARC-3D 
code,  developed  at  NASA  Ames,  is  used  as  an  example  application.  Performance  of  the 
proposed  methods  is  verified  on  the  Victor  multiprocessor  system  which  is  a  message 
passing  architecture  developed  at  the  IBM,  T.  J.  Watson  Research  Center.  A 
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1  Introduction 

Efficient  parallel  implementation  of  any  iterative  method  depends  on  the  characteristics  of  both  the 
particular  application  problem  to  be  solved  and  the  architecture  of  the  parallel  machine  on  which 
it  is  to  be  solved.  Using  iterative  methods  to  find  the  numerical  solution  of  a  partial  differential 
equation  with  boundary  conditions  generally  involves  more  than  just  the  repeated  solutions  of  linear 
systems  of  the  form 

Ax  =  b.  (1) 

The  elements  of  A  and  b  must  also  be  evaluated  at  every  iteration.  Depending  on  the  equations 
and  the  boundary  conditions,  these  calculations  can  be  a  significant  part  of  the  solution  cost.  In 
many  cases  the  data  dependencies  of  the  iterative  method  and  those  in  evaluating  the  elements  of 
A  and  b  are  considerably  different.  Therefore  load  balancing  and  communication  requirements  vary 
from  one  part  of  the  iterative  process  to  the  next.  Thus,  evaluating  the  parallelism  of  an  iterative 
method  by  considering  just  the  cost  of  solving  Equation  1  is  not  necessarily  a  good  indicator  of  the 
parallelism  of  the  corresponding  iterative  method  for  solving  the  partial  differential  equation. 

In  this  paper  we  consider  some  of  the  issues  involved  in  the  efficient  parallelization  of  iterative 
schemes  in  the  overall  context  of  a  computational  fluid  dynamics  (CFD)  application.  In  particular, 
we  examine  a  class  of  algorithms  known  as  the  implicit  schemes  for  their  parallelization  properties 
on  message  passing,  MIMD  multiprocessor  systems.  We  show  here  that,  with  appropriate  rear¬ 
rangement  of  computation  and  by  using  suitable  partitioning  strategy,  one  can  extract  high  degree 
of  parallelism  from  these  schemes.  The  ARC-3D  code,  developed  at  NASA  Ames,  is  used  as  a 
model  application.  Experimental  results  are  presented  from  the  implementations  on  the  Victor 
multiprocessor  system  developed  at  IBM,  T.  J.  Watson  Research  Center. 

The  organization  of  the  rest  of  the  paper  is  as  follows.  In  the  next  section,  various  iterative 
methods,  explicit  and  implicit,  used  in  the  CFD  applications  are  surveyed  and  their  important 
properties  are  discussed.  The  three  dimensional  Euler  and  the  thin-layer  Navier-Stokes  equations 
for  modeling  the  compressible  flow  of  a  gas  over  a  solid  body  are  described  in  Section  3.  The 
numerical  methods  used  for  solving  these  model  equations  are  considered  in  the  remainder  of  the 
paper.  The  implicit  schemes  that  are  of  interest  are  described  in  Section  4.  In  Section  5,  these 
schemes  are  broadly  classified  into  three  categories  and  the  parallel  properties  of  each  class  is 
analyzed.  The  salient  points  of  the  ARC-3D  code  and  the  implementation  aspects  are  described  in 
Sections  6,  7  and  8.  Results  are  presented  in  Section  9  and  finally  Section  10  concludes  the  paper. 

2  Numerical  Methods  in  CFD 

Whether  using  finite-element,  finite-difference,  finite- volume  or  spectral  methods,  efficient  methods 
for  solving  the  time  dependent  Euler  or  Navier-Stokes  equations  typically  involve  two  basic  types  of 
time  stepping  schemes:  the  explicit  and  the  implicit.  The  most  easily  parallelizable  and  relatively 
inexpensive  are  the  explicit  schemes  which  can  be  performed  simultaneously  at  every  grid  element, 
often  with  highly  localized  spatial  data  dependencies.  However,  these  explicit  methods,  with  a 
localized  spatial  domain  of  dependence,  are  characterized  by  a  time  step  restriction  imposed  by 
the  Courant-Friedrichs-Lewy  (CFL)  condition.  The  implicit  schemes,  on  the  other  hand,  are  not 
restricted  by  the  CFL  condition  and  are  often  advantageous  for  stiff  problems  which  contain  several 
time  scales  or  for  solving  steady  state  problems  using  the  time  dependent  equations  as  a  device  for 
the  iterative  solution  of  the  steady  state  equations.  The  time  step  of  an  implicit  method  may  be 
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restricted  by  the  need  to  maintain  a  desired  level  of  accuracy  or  to  maintain  numerical  stability.  In 
general,  the  reduction  in  the  number  of  time  steps  for  an  implicit  method  must  be  weighed  against 
the  increase  in  the  cost  of  the  implicit  calculations.  On  uni-processor  and  vector  architectures, 
the  implicit  schemes  have  proven  to  be  superior  in  many  instances.  For  multiprocessor  systems,  in 
addition  to  the  above  mentioned  factors,  the  data  dependencies  must  also  be  taken  into  account. 
Because  of  the  inherent  global  spatial  data  dependencies  that  generally  require  the  solution  of 
large  sparse  linear  systems,  the  implicit  methods  are  less  amenable  to  parallelization.  In  this 
paper  we  show  that  by  suitable  rearrangement  of  the  computation  steps  and  by  using  appropriate 
communication  and  data  partitioning  schemes,  it  is  possible  to  retain  the  efficiency  of  the  implicit 
schemes  on  parallel  systems  as  well. 

Examples  of  explicit  time  stepping  schemes  are  the  Lax  Wendroff  methods  such  as  MacCor- 
mack’s  predictor/corrector  scheme  (MacCormack,  [16], [17]),  the  linear  multistep  methods  such  as 
leap  frog  and  Adams-Bashforth,  and  the  one  step  multistage  schemes  such  as  Runge-Kutta  (Jame¬ 
son,  Schmidt,  Turkel,  [12]).  Note  that,  regardless  of  the  explicit  or  implicit  time  stepping  scheme 
used,  many  of  the  other  calculations  involved  in  typical  CFD  codes,  such  as  evaluating  the  residuals 
or  the  pressure  have  data  dependencies  similar  to  the  explicit  schemes. 

Implicit  time  stepping  schemes  appear  in  a  variety  different  methods.  These  include  the  Ap¬ 
proximate  Factorization  (AF)  /  Alternating  Direction  Implicit  (ADI)  methods  (Bredif,  [5];  Beam, 
Warming,  [3];  Abarbanel,  Dwoyer,  Gottlieb,  [1])  and  the  closely  related  Fractional  Step  methods 
(Yanenko,[29])  and  the  LU  implicit  methods  (Jameson,  Turkel,  [13]).  In  recent  years,  various  Jacobi 
and  Gauss-Seidel  type  iterative  methods  have  been  applied  to  solve  the  flux-split  equations,  either 
using  the  Newton  linearization,  (Chakravarthy,[7])  or  the  switched  evolution/relaxation  (SER) 
method,  (Mulder  and  Van  Leer, [27]).  Implicit  methods  are  also  used  to  precondition  minimum 
residual  or  conjugate  gradient  algorithms,  for  example,  the  Incomplete  LU  (Meijerink,  Van  der 
Vorst,  [18])  and  the  Strongly  Implicit  Procedure  (Stone,  [25]).  Most  of  these  methods  have  also 
been  used  as  smoothers  for  multigrid  methods  or  in  conjunction  with  explicit  schemes.  (See  [4], 
[9],  [24],  [26].)  Finally,  implicit  calculations,  in  the  form  of  residual  averaging,  are  used  in  the 
FLO  codes,  (Jameson,  [11]),  to  stabilize  and  accelerate  the  convergence  of  an  explicit  Runge-Kutta 
multigrid  method. 

3  Equations 

We  consider  implicit  finite  difference  schemes  for  the  three  dimensional  Euler  and  thin-layer  Navier- 
Stokes  equations  which  model  the  compressible  flow  of  a  gas  over  a  solid  body.  The  domain  in 
Cartesian  coordinates,  x,  y  and  z  is  mapped  onto  a  computational  domain  with  a  general  curvilinear 
coordinate  transformation  given  by 

r  =  t,  £  =  £(x,y,z,t),  r)  =  t?(i ,y,z,t),  (  =  ((x,y,z,t). 

The  surface  of  the  solid  body  is  assumed  to  be  on  a  plane  given  by  (  =  constant.  In  these  coordi¬ 
nates,  the  time-dependent  thin-layer  Navier-Stokes  equations  are 


and  p ,  pu,  pv,  pw  and  e  are,  respectively,  the  density,  the  x,  y ,  and  z  Cartesian  components  of 
momentum  and  the  total  energy  per  unit  volume.  The  vector  fluxes  are  given  by, 


pu  r  pv  i  r  Pw 

puU  +  ZxP  puV  +  T]xp  puW  +  CxP 

E  =  J-1  pvU  +  iyP  ,  F  -  pvV  +  T) yP  ,  G  =  J-1  pvW  +  (vp 

pwU  +  £zP  pwV  +  T)zp  pwW  +  (zp 

,U(e  +  p)-(tp  J  [v(e  +  p)-Vtp\  [  W(e  +  p)-(tp 

with  the  contravariant  velocities,  U ,  V,  and  W,  defined  as: 

U  =  +  £zw 

V  =  T)t  +  VxU  +  VyV  +  T]z  W 
w  =  +  Cvv  +  Czw 

and  J  is  the  metric  Jacobian. 

For  a  perfect  gas,  the  pressure,  p,  and  speed  of  sound,  a,  are  given  by 

P  =  (7  -  1)(«  -  \p{^  +  v2  +  w2)) 

a2  =  — 

P 

where  7  is  the  ratio  of  specific  heats. 

The  thin-layer  viscous  term  on  the  right  hand  side  of  the  equation  is  given  by 

0 

pmiu^  +  (/i/3  )m2Cx 
S  =  J~l  pmiV{  +  (p/3)m2(v 

fimiW(  +  (n/3)m2(z 
.  fimi m3  +  (p/3)m2((xv  +  C vv  +  (zw) 

where 

rn.  =  C2  +  CJ  +  CJ 

m2  =  CiuC  +  CvvC  + 
m3  =  (u2  4-  u2  +  w2)/2  +  Pr_1(7  -  l)_i(a2)(. 

The  parameters  p,  Re  and  Pr  are  the  dynamic  viscosity,  Reynolds  number  and  Prandtj  number, 
respectively. 


3 


The  metric  terms  can  be  obtained  by  chain  rule  from  the  definitions  of  £,  17  and  (.  The  curvilinear 


derivatives  in  terms  of  the  Cartesian  derivatives  are  given  by: 

N* 

1 

N 

II 

(y  =  J(ZT)XC  ~  zCxv ) 

6  =  J(xr,yc  - 

Vx  =  J(hvc  -  zcyc) 

Vv  =  J{x(ZC  -  X(Z() 

Vz  =  J{y (xc  -  ycxd 

Ci  =  J( y(*v  ~  Vnzi) 

Cy  =  J{ztxri  —  ZT)xt) 

(z  =  J -  X r,y() 

6 

=  -Xrtx-yr  Cy-Z-rCz 

Vt 

=  ~xrVx  -  VtVv  -  zrVt 

Ct 

=  -XrCx  -VrCy-ZrCz 

and  J  is  given  by, 


J  1  =  +  X(y(Zr,  +  X^Zf  -  -  x„y(z(  -  x(yr,z(. 

The  equations  presented  above  describe  Navier-Stokes  equations  in  three  dimensions  after  ap¬ 
plying  the  thin-layer  approximation  theory.  The  Euler  equations,  for  inviscid  flow,  are  obtained  by 
setting  the  viscous  term  on  the  right  hand  side  of  Equation  2  to  zero. 


4  Implicit  Methods 

A  numerical  solution  of  Equation  2  can  be  obtained  by  discretizing  in  both  time  and  space.  We  first 
describe  the  time  discretizations  which  lead  to  the  implicit  schemes.  A  single  step  time  discretization 
of  Equation  2  can  be  written  as 


in+1 


-Qn 


tn+1  -  tr 


,dEn+1  dFn+1  dGn+1  „  ,dSn+\ 
a(  at  +  aZ  +  -37"  ~  R'  ~J^~) 


dt 


dr] 


d( 

„  s/dEn  dFn  dGn  „ 

-(!  «)(  d£  +  dr]  +  d(-  -  Re  Q(- 


d( 
_1dSn 


)• 


Here,  Qn  is  the  state  of  the  system  at  time  step  tn,  and,  for  example, 


En  =  E{Qn). 


A  Taylor’s  series  expansion  of  the  vector  flux  terms  and  the  thin  layer  viscous  flux  term  can  be 
used  to  linearize  the  (n  -f  1)  terms  giving 


En+i  %  En+  An(Qn+1 

Fn+l  ^  pn  +  5n(gn+l 

Gn+ 1  5»Gn  +  C"(Qn+1 
Sn+l  «  Sn  +  M"(Qn+1 


-Qn) 
- Qn ) 
- Qn ) 
-Qn). 


A,  B,  C  and  M,  the  5x5  matrices,  are  the  flux  Jacobians,  given  by  and  |j|,  respectively, 

and  A Qn  =  Qn+1  -  Qn.  These  linearized  implicit  equations  can  be  written  in  the  “delta  form”  as: 


A  Q 


(3) 
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A  ,8En  8Fn  8GT 

^~df  +  ~^T  +  ~d( 


—  Re 


.ldST 


d( 


■)• 


The  choice  of  the  parameter  a  in  the  above  equations  determines  the  implicit  scheme.  When 
q  is  equal  to  one,  this  is  a  first-order  backward  Euler  scheme,  a  =  1/2  is  a  second-order  Crank- 
Nicolson,  or  trapezoidal  scheme,  and  a  =  0  is  forward  Euler,  an  explicit  scheme.  In  the  limit,  as 
At  approaches  infinity,  and  if  a  =  1,  this  becomes  a  linearized  Newton’s  method. 

We  write  Equation  3  more  simply  as: 


(7  +  aAtN)AQ  =  -AtR 


(4) 


where 


N 


and  the  explicit  terms  on  the  right  hand  side  are  given  by 

„  8En  8Fn  dGn  _ 
R  =  -ttt-  +  — —  +  Tri  -  Re 


. l8Sr 


8£  8t]  8( 


d( 


The  space  discretization  is  performed  on  a  uniform  grid  in  the  generalized  coordinate  system. 
Using  second  order  central  space  differencing  of  the  five  coupled  equations  with  lexicographic  or¬ 
dering  of  the  grid  points,  Equation  4  results  in  a  large  (5  X  5)  block  banded  linear  system.  For 
three  dimensional  problems,  the  bandwidth  is  proportional  to  the  number  of  grid  points  in  a  plane. 
Therefore,  the  direct  solution  of  the  above  system  is  impractical. 

Without  changing  the  accuracy  of  the  scheme,  the  bandwidth  of  the  linear  system  can  be 
reduced  by  factoring  I  +  aAtN .  The  factored  system  can  then  be  solved  with  a  direct  inversion 
of  each  of  the  factors.  For  example,  in  the  Approximate  Factorization  (AF)  methods,  such  as  the 
Beam- Warming  scheme,  the  linear  operator  is  written  as  the  product  of  three  factors,  one  for  each 
coordinate  direction.  Thus,  7  +  aAtN  is  factored  as 


Q  Q  a  Q 

(7  +  aAt~An)(I  +  aAt— Bn)(I  +  aAt^-Cn  -  aAtRe~'  —Mn)AQn  =  -AtRn.  (5) 

drj  d(  d( 

Central  finite  differencing  of  each  factor,  yields  a  block  tri-diagonal  matrix,  where  the  off-diagonal 
5x5  blocks  are  dense.  An  alternative  to  the  three  way  factorization  is  an  LU  factorization.  Here, 
I  +  aAtN  is  factored  so  that  Equation  4  becomes: 


LUAQn  =  -AtR 

where  L  contains  only  backward-differenced  and  U  contains  only  forward-differenced  discretizations 
of  the  spatial  derivatives.  Ordering  the  grid  points  accordingly,  L  and  U  become  lower  and  upper 
triangular  matrices.  The  bandwidth  of  these  matrices  is  as  large  as  that  of  the  unfactored  system, 
but  now  the  inversion  can  be  accomplished  in  just  two  sequential  sweeps  over  the  grid  points. 

Iterative  methods  can  also  be  used  to  solve  the  unfactored  system.  Various  relaxation  or  splitting 
methods  can  be  obtained  by  approximating  data  at  off-diagonals  by  old  data  (at  time  <n).  Jacobi, 
Gauss-Seidel  and  SSOR  methods  can  be  used,  in  their  many  three  dimensional  forms.  For  example, 
the  simplest  is  a  point  Jacobi  method,  where  values  at  each  point  are  updated  using  old  values  at  all 
surrounding  grid  points.  A  more  elaborate  method  is  a  red/black(checkerboard)-line  Gauss-Seidel, 
obtained  by  performing  tridiagonal  solves  on  all  ‘red’  lines  in  a  plane,  using  the  old  data  at  the 
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‘black’  points,  followed  by  tridiagonal  solves  on  all  black  lines,  using  the  new  values  at  the  red 
points.  This  could  be  done  first  in  the  £  -  rj  plane,  then  in  the  r]  -  (  plane,  and  then  in  the  (  -  f 
plane. 

No  matter  which  method,  direct  or  iterative,  it  is  the  data  dependencies  which  determine  the 
available  parallelism.  In  the  next  section,  all  of  these  methods  are  grouped  into  three  distinct  data 
dependence  categories. 

5  Parallelization  of  Implicit  Methods 

The  schemes  described  above,  although  all  are  implicit,  exhibit  different  degrees  of  parallelism. 
In  general,  the  extent  to  wnich  any  of  these  methods  is  parallelizable  can  often  be  predicted  by 
examining  the  nature  of  the  inter-grid  data  dependencies  in  going  from  one  time  step  to  the  next. 
We  find  it  convenient  to  classify  the  methods  into  three  categories  based  on  the  dependencies:  the 
parallel-by-point  schemes,  the  parallel-by-line  schemes,  and  the  parallel-by-plane  schemes.  These 
categories  in  effect  characterize  the  granularity  of  the  available  parallelism  and,  quantitatively,  the 
degree  of  the  extractable  parallelism.  As  a  consequence,  the  implementation  techniques  for  all  the 
schemes  within  a  category  and  their  expected  performance  are  similar. 


Parallel-by-point 

Multistage  methods 
(e.g.,  Runge-Kutta) 
Lax-Wendroff  methods 
(e.g.,  MacCormack) 
Multistep  methods 
(e.g.,  leap  frog) 
Point  Jacobi 


Parallel-by-line 

ADI/AF  methods 

(e.g.,  Beam- Warming) 
Fractional  step  methods 

Red/black-line  Gauss-Seidel 

Line  Jacobi 


Parallel-by-plane 
Implicit  LU 
ILU  and  SIP 
Point/line  Gauss-Seidel 
SSOR 


Figure  1:  A  classification  of  implicit  methods  based  on  available  parallelism. 


In  the  following,  we  describe  the  main  properties  of  the  three  categories  using  a  three-dimensional 
n  X  n  x  n  grid  as  an  example.  Note  that  the  properties  are  unaffected  even  if  the  grid  is  non- 
uniform  and  even  if  it  has  unequal  number  of  grid  points  in  each  of  the  three  dimensions. 

The  ‘parallel-by-point’  schemes  are  the  simplest.  To  advance  to  the  next  time  step,  computa¬ 
tions  can  be  done  at  each  grid  point  independently  of  the  computations  at  the  other  grid  points. 
Depending  on  the  discretization  stencil  used,  values  from  the  previous  time  step  at  one  or  more 
neighboring  grid  points  are  needed  in  the  computations  at  any  grid  point.  The  time-explicit  schemes 
are  all  examples  of  ‘parallel-by-point’  schemes.  A  few  of  the  time-implicit  schemes  also  fit  in  this 
category,  for  example,  point  Jacobi  on  the  unfactored  equations.  The  ‘parallel-by-point’  schemes 
are  extensively  analyzed  in  the  literature  for  partitioning  and  for  minimizing  the  communication 
overhead.  (See,  for  example,  Reed  et  al.,  [23].)  On  an  n  x  n  x  n  grid,  up  to  n3  proces- 
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sors  can  be  used  without  introducing  any  computational  sequentiality.  Any  processor  assignment 
scheme  which  allocates  approximately  the  same  number  of  points  to  each  processor  gives  a  good 
first  approximation  to  balancing  the  computation  load.  On  distributed  memory  architectures,  a 
static  three  dimensional  block  partitioning  of  the  grid  is  usually  sufficient  to  minimize  the  com¬ 
munication  overhead,  provided  the  interprocessor  communication  network  provides  at  least  a  three 
dimensional  connectivity.  In  practice,  two  dimensional  partitioning  schemes  also  suffice  in  reducing 
the  communication  costs  close  to  a  minimum.  Improved  load  balancing  of  the  computation  and 
minimization  of  the  communication  overhead  is  usually  achieved  by  an  analysis  of  the  effect  of  the 
data  dependencies  at  the  boundaries. 


parallel-by-point  parallel-by-line  parallel-by-plane 


Figure  2:  Types  of  parallelism. 

‘Parallel-by-line’  schemes  are  a  little  more  difficult.  To  advance  to  the  next  time  step,  the 
computation  at  each  grid  point  is  coupled  to  other  grid  points  in  a  fixed  coordinate  direction,  but 
can  be  done  independently  of  the  computations  at  grid  points  on  all  other  lines  in  that  coordinate 
direction.  In  other  words,  in  the  ‘parallel- by-line’  schemes  the  dependencies  are  such  that  all  the 
grid  points  on  a  line  must  be  updated  together.  The  computations  on  different  lines,  however,  may 
proceed  without  any  dependence  or  communication  delays.  The  quasi-one-dimensional  algorithms 
such  as  ADI,  AF  and  fractional  step  algorithms  are  in  this  category,  as  well  as  some  of  the  line 
relaxation  methods  listed  in  Figure  1.  For  an  n  X  n  x  n  grid,  there  are  n2  grid  lines  in 
each  of  the  three  dimensions  and  hence,  as  long  as  there  are  n2  or  less  processors,  the  optimal 
sequential  algorithms  can  usually  be  modified  to  give  a  good  parallel  algorithm.  With  more  than 
n 2  processors,  the  solution  process  must  usually  be  ‘blocked  out’.  That  is,  each  processor  performs 
as  many  independent  (parallel)  computations  on  local  data  as  possible  before  combining  information 
with  other  processors.  Having  reduced  the  number  of  coupled  equations,  the  remaining  equations 
are  solved  by  the  same  method  by  a  subset  of  the  processors.  If  more  than  one  line  is  to  be  solved 
simultaneously,  the  rest  of  the  processor  continue  the  reduction  on  other  lines.  A  typical  example 
of  this  is  when  performing  tridiagonal  line  solves.  The  sequentially  optimal  Thomas  algorithm  can 
be  replaced  by  a  substructuring/cyclic  reduction  algorithm  to  extract  more  parallelism.  Note  that 


the  cyclic  reduction  algorithms  require  more  than  twice  as  many  floating  point  operations  per  grid 
point  as  the  Thomas  algorithm.  Moreover,  for  these  algorithms  the  dependencies  fan  out  requiring 
richer  interprocessor  connectivity  to  reduce  the  communication  overhead.  The  cyclic  reduction 
type  algorithms,  however,  have  the  advantage  that  more  of  these  operations  can  be  done  in  parallel 
(Heller,  [8];  Ho  and  Johnsson,  [10]). 

The  third  category  of  ‘parallel-by-plane’  involves  dependencies  that  span  two  or  more  dimen¬ 
sions  and  are  more  difficult  to  handle  than  the  other  two  cases.  LU  factorization  and  Gauss-Seidel 
methods  are  examples  of  these  methods.  Extracting  parallelism  from  LU  factorization  and  Gauss- 
Seidel  methods  is  possible,  but  is  much  more  difficult,  see  (Naik,[19]).  For  a  three  dimensional 
problem,  these  methods  are  vectorizable  along  j  +  k  -f  l  =  constant  planes,  but  the  vector  length 
varies.  Similarly,  the  parallelization  could  be  done  in  planes,  at  the  cost  of  inefficient  processor  uti¬ 
lization  near  the  corners  of  the  computational  domain.  Because  of  the  nonlinearity  of  the  equations, 
the  triangular  LU  factors  are  different  at  every  point  and  at  every  time  step.  The  parallelization 
methods  of  Anderson  and  Saad  [2],  which  involve  preprocessing  overhead  which  must  be  amortized 
over  many  iterations  to  be  worthwhile,  are  not  useful  in  this  context.  Because  of  their  improved 
convergence  properties  and  increased  stability,  these  methods  warrant  further  investigation. 

For  the  remainder  of  this  paper,  we  consider  the  parallelization  of  a  ‘parallel-by-line’  scheme: 
the  Beam-Warming  Approximate  Factorization  scheme.  The  AF  methods  have  been  extensively 
studied.  They  can  be  made  to  be  total  variation  diminishing  (TVD),  (Chakravarthy  et  al.,[6]) 
and  the  boundary  conditions  are  well  understood,  (LeVeque,[15]).  The  AR.C-3D  code  developed  at 
NASA  Ames  is  based  on  this  scheme.  In  the  following  ,  we  first  briefly  described  the  key  aspects 
of  ARC-3D,  referring  the  reader  for  details  to  Pulliam  and  Steger,  [22],  and  then  discuss  possible 
parallel  implementations.  In  Section  9  we  describe  our  implementation  on  the  Victor  multiprocessor 
system  at  IBM,  Yorktown  Heights. 

6  ARC-3D 

A  widely  distributed,  general  purpose  implicit  finite  difference  code  is  the  ARC-3D  code  developed 
at  NASA  Ames,  (Pulliam  and  Steger,  [21]).  It  is  based  on  the  Beam- Warming  Approximate  Fac¬ 
torization  scheme,  described  in  Section  4.  The  numerical  solution  of  the  factored  equation  involves 
inverting  the  three  factors  shown  in  Equation  5.  Since  each  of  the  three  factors  represents  a  recur¬ 
sive  coupling  of  information  along  a  particular  spatial  direction,  the  inversion  of  a  factor  has  global 
data  dependencies,  usually  in  the  form  of  block  tridiagonal  matrices,  in  that  direction. 

Although  the  computation  costs  associated  with  Approximate  Factorization  scheme  are  consid¬ 
erably  less  than  direct  solves,  they  are  still  expensive.  A  significant  reduction  (40%)  in  computation 
can  be  effiieved  using  the  diagonalization  of  the  blocks  in  the  implicit  operator,  as  developed  by 
Pulliam  and  Chaussee  [20].  The  flux  Jacobians,  An,  Bn  and  Cn  in  Equation  5  can  be  diagonalized 
as: 

Aj  =  T~l  AnT^-  A  ~  T~^ BnT  •  A^  =  CnT^ 

For  steady  state  calculations,  a  modified  factored  algorithm  for  the  Euler  equations,  which  is  now 
at  most  first  order  accurate  in  time,  is  given  by 

T((I  +  AtdtA()N{I  +  AtdrjA^Pil  +  AtdcA()T^AQn  =  Rn 

with  N  =  T^lTn  and  P  =  T~ 1 .  For  the  explicit  form  of  the  A’s,  T”s,  N  and  P ,  see  (Pulliam  and 
Steger, [22]).  A  fourth  order  finite  difference  discretization  of  the  implicit  factors  yields  scalar  penta- 
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diagonal  and  (5  X  5)  block  diagonal  factors.  Thus,  the  numerical  solution  using  the  diagonalization 
technique  consists  of  inverting  seven  factors,  four  of  which  are  in  (5  X  5)  block  diagonal  form  and 
the  other  three,  which  correspond  to  the  three  spatial  directions,  are  in  scalar  penta-diagonal  form. 
As  in  the  block  tri-diagonal  case,  the  inversion  of  the  scalar  penta-diagonal  matrices  has  global 
dependencies  in  a  particular  spatial  direction,  but  now  the  five  variables  at  each  grid  point  are 
decoupled. 

We  consider  here,  two  versions  of  ARC-3D  code  for  examining  the  implementation  and  perfor¬ 
mance  issues  in  parallelizing  some  of  the  implicit  schemes  described  and  characterized  in  Sections  4 
and  5.  The  first  version  corresponds  to  the  standard  way  of  computing  the  approximate  factors 
and  is  based  on  the  solution  of  the  block  tri-diagonal  matrices.  We  refer  to  this  version  of  ARC-3D 
as  the  block  tri-diagonal  version.  The  other  version  considered  here  is  based  on  the  diagonalization 
technique  described  above.  This  version  of  ARC-3D  is  referred  to  as  the  scalar  penta-diagonal  ver¬ 
sion.  In  the  following,  we  describe  the  computation  steps  and  the  corresponding  data  dependencies 
in  the  two  versions. 


penta-diagonal 

block  tridiagonal 

+ 

X 

■f 

V 

+ 

X 

-r 

RIIS 

290 

356 

37 

3 

290 

356 

37 

3 

IMPLICIT 

-  setup 

235 

410 

26 

6 

389 

512 

18 

0 

-  forward  solves 

66 

99 

6 

0 

495 

510 

117 

0 

-  back  solves 

30 

30 

0 

0 

75 

75 

0 

0 

TOTAL 

331 

539 

32 

6 

959 

1097 

135 

0 

Table  1:  Floating  point  operations  per  grid  point,  non-viscous  case. 


The  implementation  of  the  Beam- Warming  scheme  in  generalized  coordinates  involves  a  se¬ 
quence  of  separate  tasks.  These  tasks  include  an  initialization  part  where  the  computations  corre¬ 
sponding  to  initial  setup  are  made.  This  is  followed  by  computations  over  each  time  step.  Each 
time  step  requires  the  solution  of  Equation  5  to  find  the  correction,  AQn.  The  computation  at 
each  time  step  is  accomplished  numerically  by  solving  either  a  (5  X  5)  block  tridiagonal  system 
or  by  solving  a  scalar  penta-diagonal  system  for  each  line,  in  each  of  the  three  spatial  directions. 
Thus,  for  an  n  x  n  x  n  grid,  there  are  n2  such  systems  to  be  solved  for  each  direction.  At  the  com¬ 
putational  level,  however,  four  separate  tasks  are  necessary  for  advancing  a  time  step.  First,  the 
discrete  boundary  conditions  must  be  calculated  at  all  boundaries.  The  inflow,  outflow  and  solid 
body  boundary  conditions  must  be  set,  and  any  symmetry  or  singular  conditions  related  to  the 
computational  domain  must  be  enforced.  Then  the  fluxes,  numerical  viscosity  and  the  thin  layer 


viscous  terms  on  the  right  hand  side  of  Equation  5  are  evaluated,  using  the  appropriate  difference 
formulas.  Next,  in  the  block  tri-diagonal  version,  each  of  the  three  5x5  coefficient  matrices  must 
be  evaluated  at  every  grid  point  in  the  domain  and  then  and  only  then  can  each  factor  be  inverted. 
For  the  scalar  penta-diagonal  version,  each  of  the  five  scalar  coefficients  of  the  penta-diagonal  sys¬ 
tem  must  be  evaluated  at  every  grid  point  for  each  variable  before  the  penta-diagonal  factor  can 
be  inverted.  Finally,  the  flow  variables,  Q,  must  be  updated: 

Q n+1  =  Qn  +  AQn. 

We  ’■efer  to  these  four  tasks  as  BC,  RHS,  IMPLICIT  and  UPDATE,  respectively. 

In  general,  a  typical  application  involves  thousands  of  time  steps.  This  amortizes  the  setup  cost 
over  the  many  iterations  and  hence  the  initialization  cost  is  insignificant.  The  work  m,rolved  in 
UPDATE  is  no  more  than  five  independent  floating  point  additions  at  each  grid  point  and  hence 
is  also  negligible.  So  we  restrict  out  attention  to  the  three  major  tasks,  BC,  RHS  and  IMPLICIT. 

The  data  dependencies  of  each  of  these  three  operations  are  quite  different.  In  BC,  only  the 
subdomain  consisting  of  boundary  points  are  updated  using  values  from  only  those  interior  points 
which  are  close  to  the  boundary  points.  The  pressure  on  the  solid  body  is  determined  via  the  normal 
momentum  equation,  and  because  central  differences  of  tangential  derivatives  of  the  pressure  are 
used,  scalar  tridiagon.il  inversions  must  be  performed  along  the  solid  body.  RHS  is  explicit  and  is 
therefore  completely  data  parallel,  but  some  of  the  calcu'ations  must  be  done  at  all  grid  points  with 
the  final  computation  at  only  the  interior  points.  Finally,  IMPLICIT  involves  either  the  inversion 
of  (5x5)  block  tri-diagonal  systems  or  scalar  penta-diagonal  systems  in  each  of  the  three  coordinate 
directions. 

The  number  of  floating  point  operations  per  grid  point  for  RHS  and  IMPLICIT  in  the  non- 
viscous  case  are  shown  in  Table  1  for  the  two  versions  of  ARC-3D.  The  number  of  floating  point 
operations  per  grid  point  in  BC  are  relatively  small  compared  to  those  in  RHS  and  IMPLICIT  and 
for  that  reason  they  are  not  shown.  Clearly,  for  both  versions  of  ARC-3D,  the  cost  of  IMPLICIT 
is  the  dominant  cost. 

7  Algorithms 

To  bring  out  the  salient  points  for  parallel  implementations,  we  now  describe  the  three  inain  tasks 
of  ARC-3D  at  the  algorithmic  level.  First,  the  algorithmic  aspects  of  IMPLICIT  are  presented  in 
some  detail  and  then  those  for  RHS  and  BC  are  briefly  described.  The  task  of  IMPLICIT  consists 
of  solving  linear  systems  by  inverting  either  block  tri-diagonal  or  scalar  penta-diagonal  matrices. 
In  the  block  tri-diagonal  case,  for  each  direction  as  many  systems  need  to  be  solved  as  there  are 
lines  of  grid  points  in  that  direction.  In  the  scalar  penta-diagonal  version,  for  each  variable  and  for 
each  direction,  the  number  of  systems  solved  is  equal  to  the  number  of  lines  of  grid  points  in  that 
direction.  In  addition  to  inverting  scalar  penta-diagonal  matrices,  this  version  requires  inversion  of 
block  diagonal  matrices.  However,  the  inverse  of  each  block  is  known  analytically,  and  the  block 
diagonal  inversions  can  be  done  without  requiring  information  from  any  other  grid  p«_mt.  The 
algorithms  used  for  solving  block  tri-diagonals  and  scalar  penta-diagonals  are  similar  to  those  used 
for  inverting  scalar  tri-diagonals,  which  we  describe  next. 

Sequentially,  the  most  efficient  algorithm  for  inverting  a  tri-diagonal  system  is  the  Thomas 
algorithm  which  is  a  special  case  of  the  Gaussian  elimination.  To  solve  a  system  of  size  n,  with 
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coefficients  A,  B  and  C  and  right  hand  side  /,  given  by, 

AjUj—i  t  BjUj  -|-  Cjuj+1  —  fj 

where  u0  and  un+1  are  known,  the  Thomas  algorithm  performs  two  sweeps  over  the  unknowns  u\ 
through  un.  In  the  forward  sweep,  coefficients  P3+i  and  qj+i,  j  —  1,  •  •  • ,  n,  are  computed,  where 

^i+1  =  (Bj-AjPjr'Cj 

+  i  =  ( ~  AjPj)  (fj  —  Ajq j). 

The  unknowns  Uj ,  j  —  n,  ■  ■  ■ ,  1,  are  determined  in  the  back  sweep  from 

Uj  =  qJ+ 1  —  PJ+  j  v.j+ 1 . 

Note  that  the  computation  requires  storage  of  all  1  and  qJ+x  from  the  forward  sweep  for  com¬ 
puting  the  unknowns  u}  in  the  back  sweep.  Clearly,  both  phases  of  the  Thomas  algorithm  have 
recursive  data  dependencies  and  hence  the  algorithm,  when  used  for  solving  a  single  system,  is 
sequential.  However,  when  several  independent  such  systems  need  to  be  solved  as  in  the  case  of 
IMPLICIT,  the  computations  of  these  systems  can  be  pipelined.  Thus,  the  above  algorithm  can 
be  parallelized  by  assigning  to  each  processor  parts  of  computations  from  several  independent  tri¬ 
diagonal  system  solves.  In  such  a  scheme,  if  a  processor  is  to  compute  unknowns  uJp  through  Ukp 
from  one  of  the  systems,  then  it  receives  the  coefficients  and  q]p  from  a  preceding  processor 
and  then  proceeds  to  compute  the  coefficients  through  Pkp  and  qkp  and  sends  the  final  coefficients 
Pkp  and  qkp  to  the  succeeding  processor.  The  same  is  repeated  for  all  other  systems  assigned  to 
it.  After  completing  all  the  forward  sweeps,  the  back  sweeps  are  performed.  First  the  values  of 
x£fcp  + 1  are  received  for  each  system  from  the  succeeding  processor  and  back  sweep  is  completed  by 
computing  the  unknowns  ukp  through  Uj  .  The  computed  value  of  Uj  is  sent  to  the  preceding 
processor  and  the  process  is  repeated  for  all  the  systems  assigned  to  that  processor.  This  we  refer 
to  as  the  pipelined  version  of  the  Thomas  algorithm.  Note  that  all  the  coefficients  from  all  the 
forward  sweeps  must  be  stored  until  the  back  sweeps  are  completed. 

The  tri-diagonals  can  also  be  computed  using  the  substructuring  or  the  cyclic  reduction  algo¬ 
rithms  (See,  e.g.,  Ho  and  Johnsson  [10]).  However,  these  schemes  are  computationally  expensive. 
The  cost  of  Thomas  algorithm  for  scalar  tri-diagonals  is  approximately  8  floating  point  operations, 
whereas  the  cyclic  reduction  type  algorithms  require  about  17  floating  point  operations  per  vari¬ 
able.  The  Thomas  algorithm  is  a  sequential  algorithm  and  if  only  one  tri-diagonal  system  is  to 
be  solved,  then  the  available  of  parallelism  is  negligible.  The  cyclic  reduction  type  algorithms,  on 
the  other  hand,  are  highly  parallelizable.  In  the  next  section  we  describe  partitioning  schemes  for 
extracting  paral'elism  from  the  Thomas  algorithm  in  the  context  of  ARC-3D  and  show  that  the 
available  parallelism  is  sufficient  for  the  machine  model  assumed  here. 

For  the  block  tri-diagonal  case,  each  of  the  coefficients  A  and  C  is  a  dense  5x5  block,  matrix 
B  is  a  5  x  5  diagonal  block,  and  Uj  and  f}  are  vectors  of  length  5.  The  computed  coefficients  P} 
and  <7;  are,  respectively,  dense  5x5  matrix  and  5x1  vector.  Thus,  in  addition  to  the  cost  of 
computing  the  coefficients,  the  corresponding  Thomas  algorithm  involves  factoring  a  5  x  5  matrix 
and  computing  six  matrix- vector  products  at  each  grid  point.  The  scalar  penta-diagonal  case  has 
five  scalar  coefficients  instead  of  three  for  each  variable  at  each  grid  point.  The  corresponding 
Thomas  algorithm  solves  penta-diagonal  systems  in  a  similar  fashion  as  the  tri-diagonal  system. 
Note  that  the  dependencies  are  in  the  form  of  the  second  order  recurrences.  For  each  variable,  the 
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forward  sweep  computes  three  coefficients  and  the  computation  of  Uj  in  the  back  sweep  involves 
values  at  uJ+i  and  uJ+ 2. 

In  RIIS,  the  fourth  order  spatial  discretization  of  the  right  hand  sides  leads  to  13  point  stencils  at 
the  interior  points,  modified  to  smaller  stencils  near  the  boundary.  These  calculations  are  performed 
at  each  grid  point  and  they  involve  explicit  data  dependencies,  i.e. ,  they  use  known  information. 
Therefore,  the  computations  for  RHS  may  be  completed  independently.  The  implementation  is 
similar  to  Jacobi  relaxation  and  the  computations  may  be  performed  in  any  order. 

In  BC,  the  inflow/outflow  boundary  points  can  also  be  updated  independently,  but  the  values 
at  boundary  points  along  the  solid  body  are  coupled  and  must  be  updated  using  tridiagonal  solves 
in  both  the  (  and  t]  directions.  We  use  the  scalar  Thomas  algorithm  for  these  tridiagonal  solves. 

8  Parallel  Implementation 

Efficient  parallel  implementation  requires  extraction  of  maximum  parallelism  with  negligible  com¬ 
munication  overhead.  This  entails  partitioning  of  the  data  across  processors  and  rearranging  the 
computations  so  that  the  load  is  evenly  balanced  and  the  communication  overhead  is  kept  to  a 
minimum.  In  practice  achieving  both  the  goals  simultaneously  is  difficult.  Furthermore,  since  the 
data  dependencies  of  BC,  RHS,  and  IMPLICIT  are  not  the  same,  reaching  this  goal  is  even  harder. 
We  describe  here  a  class  of  partitioning  schemes  that  are  suitable  for  MIMD  architectures  where  the 
number  of  processors  is  small  compared  to  the  number  of  grid  points.  For  the  following  discussion 
assume  that  p  processors  are  available  and  that  the  computational  domain  is  an  l  X  m  X  n  grid. 


1-D  partitioning  2-D  partitioning  3-D  partitioning 


Figure  3:  Uni-partition  schemes. 


8.1  Uni-partition  Schemes 

We  refer  to  a  partitioning  scheme  as  a  uni- partition  scheme  when  the  domain  of  computation  is 
subdivided  into  p  partitions  and  one  partition  is  assigned  to  each  processor.  There  are  several 
ways  of  partitioning  the  domain  in  this  manner.  The  simplest  and  the  most  commonly  used  are 
those  where  the  domain  is  partitioned  evenly,  or  almost  evenly,  along  one,  two,  and  three  spatial 
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dimensions.  If  the  computational  domain  has  at  least  p  grid  points  in  one  of  the  spatial  directions 
then  partitioning  the  grid  into  p  slices  along  that  direction  gives  the  1-D  uni-partition  scheme.  If 
the  number  of  grid  points  on  a  plane  is  at  least  equal  to  p,  then  the  2-D  uni-partitioning  scheme 
is  possible.  Here  the  grid  is  partitioned  along  two  spatial  directions.  Similarly,  in  the  3-D  uni¬ 
partitioning  scheme,  the  domain  is  divided  along  all  the  three  dimensions.  See  Figure  3. 

The  3-D  partitioning  scheme  has  the  smallest  surface  to  volume  ratio  whereas  the  1-D  scheme 
has  the  largest.  Thus,  for  RHS,  where  the  data  dependencies  are  local,  the  3-D  partitioning  scheme 
reduces  the  volume  of  data  communicated.  However,  in  this  scheme,  some  processors  may  have 
to  communicate  with  at  least  six  other  processors  and  possibly  with  up  to  12  other  processors. 
This  scheme  is  also  unattractive  if  a  three  dimensional  mesh  cannot  be  mapped  efficiently  on  the 
underlying  inter-processor  communication  network.  Furthermore,  the  use  of  13  point  stencil  in 
RHS  makes  it  necessary  to  maintain  a  buffer  containing  two  rows  or  two  columns  worth  of  extra 
information  from  the  neighboring  partitions  at  each  surface  of  the  partition  that  is  adjacent  to 
another  partition.  This  memory  overhead  increases  as  the  number  of  grid  points  per  processor 
decreases.  In  the  extreme  case  where  one  grid  point  is  assigned  to  a  processor,  the  information  at 
the  surrounding  12  grid  points  may  have  to  be  buffered. 


Sequential  Delay 

Memory  Required 

0(n) 

0(n3/p ) 

0(n2) 

0{n2/p ) 

0(n3) 

0{n/p ) 

Figure  4:  The  pipelining  delay  and  memory  requirement  tradeoff. 


For  the  IMPLICIT  part,  the  1-D  partitioning  scheme  requires  inter-processor  communication 
in  only  one  of  the  three  directions.  Thus,  for  the  1-D  partitioning  scheme,  the  adverse  effect  of  the 
data  dependencies  is  restricted  along  only  one  direction.  The  effect  of  data  dependencies  may  be 
further  reduced  by  pipelining  the  forward  and  back  sweeps  of  IMPLICIT,  as  observed  in  (Johnson, 
Saad  and  Schultz, [14]).  The  larger  the  number  of  lines  that  can  be  pipelined,  the  lesser  the  effect  of 
sequential  data  dependencies  present  in  the  Thomas  algorithm.  If  there  is  no  pipelining,  then  each 
of  the  line  solves  is  done  sequentially,  irrespective  of  the  number  of  processors  used.  However,  if  all 
the  line  solves  are  pipelined,  then  the  net  effect  is  that  the  computation  may  be  completed  in  time 
equal  to  one  line  solve  performed  sequentially  plus  the  time  required  to  compute  the  remaining  line 
solves  with  perfect  parallelism.  This  cost  can  be  further  reduced  by  starting  the  computations  from 
both  sides  simultaneously.  This  reduces  the  sequential  cost  of  filling  the  pipe  by  half.  In  the  2-D 
and  3-D  partitioning  schemes,  the  data  dependencies  of  IMPLICIT  are  spread  across  processors 
along  two  and  three  dimensions,  respectively.  This  gives  rise  to  data  dependency  delays  in  two  and 
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three  directions.  Thus,  if  no  pipelining  of  computations  is  performed,  then  the  3-D  partitioning 
scheme  is  the  most  adversely  affected  and  has  very  little  parallelism.  With  full  pipelining,  the  2-D 
and  3-D  cases  have  a  delay  of  two  and  three  sequential  line  solves,  respectively.  Note  from  Table  1, 
that  for  the  type  of  problems  considered  here,  the  cost  of  one  sequential  line  solve  can  be  very  high 
and  may  quickly  dominate  the  total  parallel  cost.  This  is  particularly  true  when  the  number  of  grid 
points  assigned  to  each  processor  is  small.  By  assigning  a  sufficiently  large  number  of  grid  points 
to  each  processor,  one  may  reduce  the  effect  of  the  sequential  part  in  the  pipelined  computation. 

However,  there  is  a  penalty  to  be  paid  in  terms  of  memory.  In  order  to  be  able  to  pipeline  the 
computations,  it  is  necessary  to  store  the  intermediate  results  of  all  the  forward  solves  that  are 
pipelined.  This  entails  a  memory  overhead  of  storing  an  extra  5x5  matrix  at  each  grid  point  for 
the  block  tri-diagonal  case.  In  the  penta-diagonal  case  the  memory  overhead  is  eleven  coefficients 
per  grid  point  along  each  direction.  This  gives  rise  to  an  interesting  tradeoff  between  the  memory 
requirement  and  the  benefit  of  pipelining  the  IMPLICIT  computations.  See  Figure  4. 

Under  the  uni-partitioning  scheme,  the  substructuring  or  the  cyclic  reduction  type  algorithms 
may  also  be  used  in  IMPLICIT.  These  algorithms  have  higher  degree  of  available  parallelism,  but 
there  is  a  penalty  of  almost  doubling  the  computation  cost.  Thus,  the  1-D  and  2-D  partitioning 
schemes  with  substructuring  generally  fair  poorly  as  compared  to  the  Thomas  based  IMPLICIT. 

For  3-D  partitioning  schemes,  the  communications  is  global  in  all  the  three  dimensions,  making 
these  schemes  communication  intensive. 

In  Table  2,  some  of  the  issues  described  above  are  quantified  for  computing  scalar  tri-diagonals 
along  all  the  three  dimensions  of  an  n  X  n  X  n  grid  and  using  p  processors,  neglecting  lower  order 
terms.  These  complexities  assume  p  <  n,  though  the  complexities  for  the  2-D  Pipelined  and  3-D 
Diagonal  scheme  remain  valid  for  p  <  n2  and  the  3-D  complexity  is  valid  up  to  p  <  n3.  The 
multi-partition  schemes  are  described  in  the  next  section.  The  pipelined  methods  use  the  Thomas 
algorithm  with  1-D,  2-D  and  3-D  partitioning.  The  arithmetic  operations  count  gives  the  total 
parallel  computation  cost,  including  the  dependency  delays.  The  column  ‘words  copied’  gives  the 
amount  of  data  that  must  copied  into  the  memory  because  of  the  partitioning  scheme  used.  The 
number  of  messages  is  the  minimum  number  of  messages  that  must  be  encountered  in  a  sequence 
from  the  beginning  to  the  end  of  the  computation.  This  sequence  of  messages  represents  minimum 
communication  delay  in  that  partitioning  scheme  because  of  the  message  overhead.  This  number 
takes  into  account  all  the  messages  sent  by  all  the  processors  in  parallel.  The  message  size  is  in 
terms  of  the  number  of  words  per  message.  For  Thomas  based  algorithms,  two  words  must  be  sent 
per  line  in  the  forward  sweep  and  one  word  in  the  back  sweep.  The  message  size  for  the  forward 
sweep  could  be  increased  to  2  words,  reducing  the  number  of  messages  sent  in  the  forward  sweep 
by  a  factor  of  2.  The  substructure  algorithm  uses  1-D  partitioning,  with  the  Thomas  algorithm  in 
the  two  directions  which  need  no  communication  and  the  substructure  algorithm  in  the  direction 
which  crosses  processor  boundaries.  < 

There  are  two  major  drawbacks  of  the  uni-partitioning  schemes  described  above.  First,  load 
balancing  is  difficult.  In  practice,  the  problems  to  be  solved  may  have  grid  sizes  that  cannot  be 
evenly  divided  among  the  processors.  Thus,  if  the  granularity  of  the  load  distribution  is  to  be 
retained  at  grid  level,  then  some  partitions  may  end  up  with  a  higher  number  of  grid  points  to 
compute  than  the  others.  Since  the  computations  per  grid  point  are  significant,  the  effect  of  this 
load  imbalance  may  be  considerable.  This  is  especially  true  when  the  number  of  processors  is 
large  and  the  number  of  grid  points  assigned  to  a  processor  is  relatively  small.  For  the  partitioning 
schemes  described  above,  the  problem  is  compounded  by  the  fact  that  the  load  imbalance  may  be  of 
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the  order  of  lines  of  grid  points  or  even  planes  of  grid  points.  This  adversely  affects  the  performance 
irrespective  of  whether  only  a  small  number  of  processors  or  a  large  number  of  processors  are  used. 
If  non-standard  partitions  are  used,  then  the  load  may  be  better  balanced,  but  the  communication 
and  programming  costs  may  turn  out  to  be  unreasonable.  The  second  drawback  applies  even  when 
the  grid  can  be  evenly  divided  among  the  processors.  This  is  because  of  the  uneven  distribution 
of  the  computational  work  at  the  grid  points.  The  interior  points  which  form  the  major  bulk 
of  the  computation  have  higher  amounts  of  computational  work  associated  with  them  than  the 
grid  points  on  the  boundary.  With  this  unevenness,  any  grid-level  partitioning  scheme  results  in 
a  load  imbalance.  Furthermore,  the  data  dependencies  of  BC,  RHS,  and  IMPLICIT  are  totally 
different  and  BC  applies  only  to  the  boundary  points,  RHS  is  applied  over  all  the  grid  points, 
and  IMPLICIT  is  applied  only  to  the  interior  points.  This  adds  to  the  data  dependence  delays 
even  if  the  computational  work  is  perfectly  divided  among  the  processors,  since  the  processors  with 
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partitions  that  include  the  boundary  grid  points  are  required  to  work  on  BC  and  RHS  while  those 
with  only  the  interior  points  are  not.  One  way  to  reduce  the  detrimental  effects  of  load  imbalance 
is  to  assign  multiple  partitions  to  each  processor.  This  is  described  in  the  next  section. 


8.2  Multi-partition  Scheme 


In  a  multi- partition  scheme,  the  computational  domain  is  divided  into  a  number  of  sub-partitions 
that  is  larger  than  the  number  of  processors  and  each  processor  is  assigned  more  than  one  such  sub¬ 
partition.  An  obvious  advantage  is  that  the  granularity  of  each  sub-partition  may  be  made  much 
smaller  than  that  in  the  uni-partition  scheme.  This  gives  better  control  over  the  computational 
load  distribution.  Moreover,  as  we  describe  next,  not  only  is  it  possible  to  better  balance  the  overall 
computational  work,  but  also  the  load  may  be  balanced  at  the  task  level  so  that  the  additional 
data  dependency  delays  are  not  introduced.  In  the  following,  we  describe  one  such  scheme  that 
we  refer  to  as  the  diagonal  partitioning  scheme.  A  partitioning  scheme  similar  to  the  2-D  vers:or 
presented  below  is  described  by  Johnsson  et  al.  for  implementing  ADI  methods  on  multiprocessor 
systems  [14]. 
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Figure  5:  Multi-partition  schemes. 


In  the  diagonal  partitioning  scheme,  the  domain  is  partitioned  in  a  manner  similar  to  that  of 
the  2-D  or  the  3-D  uni-partitioning  scheme  described  above.  We  refer  to  these  as  the  2-D  diagonal 
and  3-D  diagonal  partitioning  schemes,  respectively.  In  the  former  case,  the  domain  is  subdivided 
into  p2  sub-partitions  and  in  the  later  case  it  is  subdivided  into  p3/2  sub-partitions.  In  the  2-D 
diagonal  partitioning  scheme,  each  processor  is  assigned  one  sub-partition  from  each  row  and  from 
each  column  of  sub-partitions.  Similarly  in  the  3-D  diagonal  partitioning  scheme,  each  processor  is 
assigned  one  sub-partition  from  each  plane  of  sub-partitions.  This  arrangement  results  in  assigning 
sub-partitions  to  a  processor  that  form  diagonal  chains  in  two  or  three  dimensions.  Note  that,  for 
the  2-D  diagonal  partitioning  scheme  to  be  applicable,  the  computational  grid  should  have  at  least 
p  grid  points  on  each  side  of  one  of  the  2-D  planes  and  for  the  3-D  diagonal  partitioning  case  the 
computational  grid  should  have  at  least  p1/2  grid  points  along  each  of  the  three  dimensions. 

The  above  described  diagonal  partitioning  and  assignment  schemes  require  each  processor  to 
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compute  all  the  three  tasks:  BC,  FHS,  and  IMPLICIT.  Because  each  processor  is  assigned  one  sub- 
pa:tition  from  each  row  and  column  of  sub-partitions,  the  computation  is  much  better  balanced 
even  when  the  number  of  grid  points  along  each  dimension  of  the  grid  is  not  evenly  divisible  by  p 
or  p1/2.  An  interesting  aspect  of  the  2-D  diagonal  partitioning  scheme  is  that  any  processor  may 
need  to  communicate  with  only  two  other  processors  even  though  there  are  four  sub-partitions 
surrounding  each  sub-partition.  In  the  3-D  case,  each  processor  needs  to  communicate  with  six 
other  neighbors. 

The  d  iagonal  schemes,  however,  have  significantly  larger  surface  area  that  is  adjacent  to  other 
sub-partitions.  This  results  in  higher  communication  requirements.  The  communication  and  com¬ 
putation  requirements  in  the  scalar  tridiagonal  case  for  the  2-D  diagonal  partitioning  scheme  are 
shown  in  Table  2. 

If  the  communication  overhead  is  not  substantial  then  the  diagonal  partitioning  schemes  have  a 
clear  advantage  over  the  2-D  and  3-D  uni-partitioning  schemes  described  earlier.  This  can  be  seen 
by  observing  the  maximum  achievable  speedups.  For  example,  with  the  3-D  diagonal  partitioning 
scheme  applied  to  an  n  X  n  x  n  grid  and  using  n2  processors,  the  maximum  achievable  speedup  is 
n2 .  The  same  with  the  2-D  uni-partitioning  scheme  is  3n2/5  and  with  3-D  uni-partitioning  scheme 
it  is  n2/2. 

Another  variation  of  multi-partitioning  scheme  is  the  case  where  the  domain  is  partitioned  into 
p2  sub-partitions  as  in  the  case  of  2-D  uni-partitioning  scheme,  but  the  partitions  are  switched 
during  the  computation  of  IMPLICIT.  Here,  each  processor  is  initially  assigned  p  sub-partitions 
all  from  the  same  row  of  sub-partitions.  After  completing  the  IMPLICIT  in  two  of  the  directions 
that  lie  in  the  plane  of  sub-partitions  assigned  to  the  processor,  the  processors  transpose  the  sub¬ 
partitions  before  working  on  the  final  dimension  of  IMPLICIT.  With  this  scheme,  there  is  no 
communication  during  the  IMPLICIT  computation  except  for  the  transpose.  The  advantage  is 
that  there  are  no  data  dependency  delays.  The  costs  associated  with  this  scheme  for  the  scalar 
tri-diagonal  case  are  shown  in  Table  2. 

9  Implementation  on  Victor 

We  now  describe  some  of  the  results  from  the  implementations  on  the  Victor  multiprocessor  system 
developed  at  T.  J.  Watson  Research  Center,  Yorktown  Heights.  Victor  is  a  message  passing  MIMD 
architecture  with  Inmos  T800  transputer  as  the  processing  unit.  Each  node  is  associated  with  4 
MB  DRAM  and  is  connected  to  four  other  nodes  forming  a  2-dimensional  grid  interconnection. 
Victor  is  a  modular  architecture  that  can  be  configured  from  16  to  256  nodes  (Wilcke  et.  al., [28]). 
All  the  implementations  were  made  using  the  Inmos  3L  parallel  Fortran  environment.  The  3L- 
supported  thread  routines  were  used  to  mimic  an  asynchronous  communication  environment  for 
overcoming  some  of  the  difficulties  posed  by  the  synchronous  communication  supported  by  the 
transputer  hardware. 

For  all  the  results  presented  here,  we  use  the  flow  past  a  semi-infinite  hemisphere-cylinder 
body  as  a  model  problem.  The  computations  were  performed  on  a  grid  with  30  points  in  the 
axial  direction,  12  points  circumferentially  and  30  points  in  the  normal  direction.  Figures  6  and  7 
show  the  body  and  the  associated  grid  in  the  curvilinear  coordinates.  All  the  computations  were 
performed  with  double  precision  arithmetic. 

We  have  considered  three  different  partitioning  schemes  for  both  the  block  tri-diagonal  and  the 
scalar  penta-diagonal  versions.  These  are  the  1-D  and  2-D  uni-partitioning  schemes,  and  the  2-D 


Figure  6:  The  hemisphere-cylinder  body. 


diagonal  partitioning  scheme.  In  all  the  cases  the  partitioning  of  the  computational  domain  was 
along  the  axial  and  the  normal  directions  only. 


(1,1) 

(2,2) 

(3,3) 

(4,4) 

BC 

.988 

.501 

.385 

.317 

RHS 

30.1 

8.42 

4.26 

2.44 

IMPLICIT 

199. 

51.4 

26.4 

13.2 

TOTAL 

230. 

60.3 

31.0 

15.9 

Table  3:  Timings  for  the  block  tridiagonal  version,  using  1,  4,  9  and  16  processors  (in  seconds  per 
time  step). 

The  results  from  the  implementations  of  the  scalar  penta-diagonal  and  block  tri-diagonal  ver¬ 
sions  of  ARC-3D  using  the  2-D  uni-partitioning  schemes  are  compared  in  Tables  3  through  6.  In 
all  the  cases  the  timings  are  given  for  each  of  the  three  tasks  for  four  different  processor  config¬ 
urations.  These  are  a  1  processor  (1,1),  4  processors  (2,2),  9  processors  (3,3),  and  16  processors 
(4,4).  In  Tables  3  and  4  actual  wall  clock  timings  per  time  step  are  given.  The  speedups  over  one 
processor  timings  are  given  in  Tables  5  and  6.  It  can  seen  that,  though  the  block  tridiagonal  version 
is  more  expensive  than  the  penta-diagonal  version,  it  is  more  efficient  in  extracting  the  available 
parallelism. 

An  interesting  observation  is  the  comparison  of  the  performance  of  9  processors  with  that  of 
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Figure  7:  A  cross-section  of  the  grid  in  curvilinear  coordinates. 


(1.1) 

(2,2) 

(3,3) 

(4,4) 

BC 

.988 

.560 

.426 

.438 

RHS 

30.2 

8.52 

4.36 

2.57 

IMPLICIT 

64.6 

17.1 

8.93 

4.49 

TOTAL 

95.8 

26.2 

_ 

13.7 

7.50 

Table  4:  Timings  for  the  penta-diagonal  version,  using  1,  4,  9  and  16  processors  (in  seconds  per 
time  step). 


16  processors.  For  the  problem  size  considered  here  and  for  the  2-D  uni-partitioning  schemes,  the 
interior  points  are  evenly  divisible  among  the  16  processors,  but  are  not  divisible  equally  among  9 
processors.  The  effect  of  this  load  imbalance  is  clear  from  the  Tables  5  and  6. 

The  relative  performance  of  the  three  tasks  BC,  RHS,  and  IMPLICIT  are  also  to  be  noted. 
As  expected,  BC  performs  poorly,  because  most  of  the  computation  in  BC  is  along  the  solid  body 
which  is  computed  by  only  4  processors  in  the  case  of  16  processor  implementation  and  by  only  3 
processors  in  the  case  of  9  processor  implementation.  RHS,  although  easily  parallelizable,  performs 
poorly  as  compared  to  IMPLICIT  mainly  because  of  the  load  imbalance.  The  partitions  divide 
the  interior  points  evenly,  but  this  is  not  true  when  the  boundary  points  are  also  included.  In 
RHS,  computations  are  performed  at  the  interior  as  well  as  at  the  boundary  points  which  affects 
its  performance.  IMPLICIT  has  only  the  interior  points  to  compute  and  it  performs  very  well. 


Even  here  there  is  a  slight  load  imbalance  that  is  not  obvious.  The  load  imbalance  is  caused  by 
the  fact  that  the  computations  at  the  first  point  of  a  line  solve  are  considerably  less  than  those  at 
the  rest  of  the  interior  points.  However,  the  sequential  effect  of  the  pipelining  of  IMPLICIT  masks 
this  imbalance. 

The  diagonal  multi-partitioning  schemes  are  much  harder  to  implement  than  any  of  the  uni¬ 
partition  schemes.  The  difficulty  arises  in  efficiently  managing  the  multiple  sub-partitions  assigned 
to  each  processor.  Since  the  number  of  sub-partitions  assigned  to  a  processor  increases  as  the  total 
number  of  processors  working  on  the  problem  increases,  the  overhead  for  large  number  of  processors 
can  be  very  high,  especially  in  the  2-D  diagonal  case  where  the  relative  gains  are  small. 


(1,1) 

(2,2) 

(3,3) 

(M) 

BC 

1.00 

1.97 

2.57 

3.12 

RHS 

1.00 

3.57 

7.07 

12.3 

IMPLICIT 

1.00 

3.87 

7.54 

15.1 

TOTAL 

1.00 

3.81 

_ 

7.42 

14.5 

Table  5:  Speedups  for  the  block  tridiagonal  version,  using  1,  4,  9  and  16  processors. 

Currently,  we  are  in  the  process  of  completing  this  implementation  and  the  results  seen  so  far 
are  encouraging.  For  the  2-D  diagonal  version,  the  assignment  scheme  results  in  each  processor 
having  to  communicate  with  only  two  other  processors.  These  neighboring  processors  remain 
the  same  throughout  the  entire  computation.  Thus,  if  the  processors  can  be  arranged  in  a  loop, 
the  2-D  diagonal  version  can  be  mapped  onto  these  processors  which  results  in  nearest  neighbor 
communication  for  BC,  RHS,  and  IMPLICIT.  Mapping  a  loop  onto  a  mesh  is  relatively  easy. 
However,  to  retain  nearest  neighbor  communication  only  the  loops  with  even  number  of  processors 
are  permissible. 


(1,1) 

(2,2) 

(3,3) 

(4,4) 

BC 

1.00 

1.76 

2.32 

2.26 

RHS 

1.00 

3.54 

6.93 

11.8 

IMPLICIT 

1.00 

3.78 

7.23 

14.4 

TOTAL 

1.00 

3.66 

6.99 

12.8 

Table  6:  Speedups  for  the  penta-diagonal  version,  using  1,  4,  9  and  16  processors. 


10  Conclusions 

We  have  attempted  to  show,  by  analysis  and  experimentation,  the  extent  to  which  CFD  applications 
based  on  implicit  scheme  can  be  parallelized.  A  specific  example  of  an  implicit  scheme  involving 
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line  solves  in  all  three  coordinate  directions  was  considered.  It  is  observed  that  the  affect  of  the 
various  data  dependencies  of  the  different  parts  of  the  algorithm  adversely  affect  the  parallelism, 
but  these  can  be  minimized  with  an  appropriate  partitioning  strategy. 
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